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I.  INTRODUCTION 

Under  Contract  DNA001-72“C-0057,  MAGI  has  completed  the  design  and  imple- 
mentation of  a preliminary  version  of  a Monte  Carlo  three-dimensional,  time-de- 
pendent radiation  penetration  in  complex  geometry  code  for  the  ILLIAC  IV.  The 
design  should  serve  as  a guide  to  the  programming  of  complex  Monte  Carlo  codes 
for  any  parallel  computers,  and,  with  some  modification,  for  vector  computers. 

The  implementation  was  supposed  to  show  the  feasibility  of  such  a design  and 
the  efficiency  of  the  design.  It  was  also  supposed  to  produce  a valuable  tool 
for  radiation  penetration  studies. 

Under  the  present  contract,  work  has  been  done  towards  the  inclusion  of  the 
treatment  of  neutrons  and  towards  other  improvements  of  the  code.  The  major 
effort,  however,  turned  out  in  attempts  to  make  the  preliminary  version  more  or 
less  operational  on  the  actual  ARRAY  processor  of  the  ILLIAC  IV,  rather  than 
under  translation  or  under  simulation  on  conventional  computers.  From  the  few 
short  runs  which  we  were  able  to  complete,  we  extrapolate,  for  production  runs, 
an  efficiency  of  parallelism  of  about  seventy  percent.  This  result  has  little 
statistical  significance,  but  is  consistent  with  our  original  estimate.  Other 
factors  which  reduce  the  speed  of  calculations  are  discussed  in  the  body  of  the 
report.  The  estimated  values  of  these  additional  factors  are  irrelevant  for  a 
future  computing  system  with  improved  hardware  and  systems  software. 

The  attempt  to  validate  our  method  of  parallelizing  Monte  Carlo  calculations 
were  unfortunately  performed  on  a computer  at  the  time  when  its  hardware  and  soft 
ware  were  rather  unreliable.  The  tentative  conclusion  which  we  can  draw  from 
this  attempt  is  not  in  contradiction  with  the  validity  of  our  approach. 


II.  EXTRACTING  PARALLELISM  IN  MONTE  CARLO 


The  method  we  designed  Is  fully  described  in  references  1 and  2.  It 
should  be  applicable  to  parallel  and  vector  computers  in  general,  but  was 
particularly  designed  for  radiation  penetration  calculations  on  the  ILLIAC  IV 
computer. 

The  ARRAY  processor  of  the  ILLIAC  IV  system  consists  of  sixty-four  pro- 
cessing elements  (PE's),  each  with  its  own  memory,  all  under  the  control  of  a 
single  control  unit  (CU).  Data  can  be  transferred  between  PE  memories  by  an  op- 
eration called  routing.  The  PE's  all  carry  out  the  same  operation  at  the  same 
time,  as  prescribed  by  the  CU,  each  operating  on  its  own  memory.  Individual 
PE's  can,  however,  be  disabled  for  particular  operations.  With  this  computer 
design,  and  with  perfect  code  organization,  if  it  could  be  achieved,  it  would 
be  possible  to  perform  a calculation  sixty-four  times  faster  than  on  a conven- 
tional computer  with  the  same  hardware  component  speeds. 

The  major  difficulty  with  attempting  to  implement  a Monte  Carlo  code  such 
as  SAM-CE  (2)  on  the  ILLIAC  lies  in  the  intrinsic  disorderly  nature  of  Monte 
Carlo  logic.  Although  considerably  modified  by  importance  sampling  techniques, 
a Monte  Carlo  history  is  still  a computer  simulation  of  the  physical  events 
occurring  as  a particle  (neutron  or  photon)  traverses  physical  media.  The  order 
and  the  nature  of  these  physical  events  have  little,  if  any,  correlation  from 
history  to  history.  The  naive  approach  of  following  6^4  histories  simultaneously 
is  therefore  not  feasible  as  the  parallelism  breaks  down  almost  immediately. 

Our  approach  is  to  initiate  many  histories  in  each  PE,  and  hold  all  of  them  in 
abeyance  until  any  calculation  is  required.  "Holding  the  history  in  abeyance" 
means  that  all  relevant  information  about  the  history  - or  particle  (which  we 
call  "latent  particle")  is  to  be  stored  in  computer  memory.  The  information 
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j consists  of  data  such  as  position,  direction,  energy,  time,  etc.,  and  temp- 

1 

orary  information,  the  most  important  of  which  is  the  type  of  calculation 
which  is  to  be  performed  next.  Only  if  enough  PE's  have  at  least  one  latent 
waiting  for  a given  calculation,  will  that  calculation  be  performed.  The  differ- 
ent calculations  are  performed  by  completely  independent  calculational  modules. 
Referring  to  a conventional  flow  diagram,  each  module  performs  all  the  calcula- 
tions involved  from  start  of  a branch  up  to  and  including  the  branch  point. 

I 

For  example,  a module  operating  on  latents  just  entering  collision  retrieves  and 
interpolates  the  absorption  cross  sections,  and  samples  the  absorption  prob- 
abilities. Another  module  operates  on  latents  entering  scattering  and  selects 
the  type  of  scattering  to  occur,  etc.  Each  module  operates  on  latents  of  the 
» corresponding  type,  changes  some  of  the  data  representing  the  latents,  and 

j changes  the  state  of  the  latents. 

An  executive  program  keeps  a tally  of  latent  types,  and  calls  into  execu- 
tion only  those  modules  which  have  at  least  one  latent  in  at  least  N PE's.  It 
, is  attempted  to  keep  the  efficiency  criterion  N as  high  as  possible,  but  N is 

I 

reduced  when  required. 

The  parallelism  is  extracted  in  problems  of  any  complexity.  The  cross 
section  data  representation  can  be  as  exact  as  wished.  Geometrical  details  can 
be  described  using  combinatorial  geometry^. 
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III.  VALIDATION  OF  THE  INTERIM  VERSION  OF  THE  CODE 

The  interim  version  of  the  code  is  described  in  detail  in  reference  2. 

Our  first  task  was  to  make  this  version  operational  on  the  ARRAY  processor. 

This  turned  out  to  be  the  major  task  of  the  current  project,  which  exhausted 
the  project  funds. 

Before  describing  this  effort,  it  is  necessary  to  give  a short  description 
of  the  operational  environment  at  the  Institute  for  Advanced  Computation.  It 
shouid  be  noted  that  our  operational  experience  ended  in  January,  197^,  and 
does  not  reflect  improvements  (if  any)  after  that  date. 

The  operational  system  includes  an  impressive  configuration  of  peripheral 
equipment  with  the  omission  of  convenient  input  and  output  means.  The  peri- 
pheral devices  are  driven  by  a powerful  control  language  (ACL).  The  only  means 
for  I/O  are  either  interactive  terminals  or  file  transfer  via  ARPA  network.  The 
file  transfer  works  most  of  the  time  for  card  input,  but  numerous  attempts  in 
the  course  of  twelve  months  were  not  successful  to  produce  a readable  printed 
output. 

The  system  is  peripheral  to  the  main  computing  device,  the  ARRAY  processor, 
(AP).  The  ARRAY  processor  was  down  most  of  the  time,  and  unreliable  the  remain- 
der of  the  time.  The  only  I/O  capabilities  between  the  AP  and  the  "system"  are 
blocks  of  1024  contiguous  words  in  internal  format,  with  no  supported  software 
for  conversion  from  internal  format  to  decimal  and  vice  versa. 

It  should  be  understood  that,  given  such  a system,  one  can  spend  quite  a 
bit  of  time  obtaining  results  from  a fully  operational  code.  For  instance,  a 
code  computing  2x2  sixty  four  times  will  wait  for  execution  for  a time  ranging 
from  a few  to  many  days,  and  has  a very  good  chance  to  abort  or  give  wrong 
answers.  We  engaged  in  testing  out  a code  which  is  somewhat  more  complex,  and 


which  was  partially  tested  out  only  under  translation^.  The  strategy  for 
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testing  was  to  move  selected  intermediate  answers  into  a reserved  memory  block, 
and,  after  storage  on  the  system's  disk,  examine  those  data  via  interactive 
terminal.  After  months  of  perseveration,  desperation  and  determination,  a few 
bugs  have  been  found  and  corrected.  A few  of  these  were  traced  to  the  system 
software,  and  were  kindly  corrected  by  AMES  personnel.  A few  were  due  to  differ 
ent  interpretations  of  GLYPNIR  statements  by  the  GLYPLIT  translation  and  the 
actual  ILLIAC  compiler.  The  remainder  were  logical  errors  which  did  not  show 
up  during  test  runs  under  translation.  Notwithstanding  the  fact  that  we 
strongly  suspect  that  not  all  branches  of  the  code  have  been  exercised,  the 
testing  stopped  as  soon  as  a set  of  reasonable  answers  was  obtained  for  the 
single  test  problem  we  were  nursing.  The  code  was  then  exercised  without  modi- 
fication, and  it  sometimes  reproduced  the  same  answers.  The  number  of  Monte 
Carlo  histories  was  then  increased  to  a few  hundred  and  the  code,  after  a few 
bad  runs,  calculated  answers  which  appear  to  be  right  - the  computed  flux  being 
consistent  with  the  analytical  flux,  within  the  computed  standard  deviation. 
Several  attempts  to  run  the  code  for  a few  thousand  histories  were  all  unsuccess 


IV.  ANALYSIS  OF  EFFICIENCY 


During  the  design  stages  of  the  program,  we  predicted*’^  that  the  ILLIAC 
version  of  the  code  will  run  twenty  times  faster  than  the  CDC  6600  computer. 

We  repeat  the  arguments  quoted  in  Section  9 of  reference  1.  On  the  basis  of 
learned  estimates,  we  predicted  that  useful  calculations  will  be  performed  with 
an  average  PE  utilization  (e)  = 0.7  to  0,8.  We  recognized  the  fact  that,  to 
achieve  that  degree  of  parallelism,  there  is  a heavy  overhead  associated  with 
routing,  searching,  etc.,  with  no  equivalent  on  a conventional  computer.  This 
overhead  reduces  the  speed  of  calculations  by  a factor  f which  we  guesstimated 
to  be  of  the  order  of  0.5.  Finally,  the  speed  of  calculation  Sp^  of  a single 
PE  was  taken  to  be  equal  to  the  hardware  design  speed:  Sp^^S^^,  where  is 
the  actual  speed  of  the  CDC  6600  computer. 

The  overall  speed  of  calculation  in  the  ILLIAC-4  is  equal  to: 

S|4  = e.f.6A  ($^^^66 
00 

Substituting  our  estimates  of  e,  f,  and  Sp^/S^^ 

Sn  20  (2) 

Our  (limited)  experience  with  the  "running"  version  of  the  code  indicates 
a speed  drastically  smaller  than  that  given  by  Equation  (2).  Although  this  re- 
sult Indicates  that  the  current  version  of  the  code,  run  with  the  current  state 
of  hardware  and  software  is  uneconomical,  it  is  important  to  attempt  to  analyze 
the  different  factors  of  Equation  (1),  as  these  may  be  relevant  for  the  design 
of  a variety  of  Monte  Carlo  codes  for  a variety  of  parallel  or  pipeline  com- 
puters, including  even  an  upgraded  ILLIAC. 
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1 . The  P-E  Utilization  Factor  e 

This  is  the  most  interesting  factor  from  the  point  of  view  of  code  design. 
The  estimate  of  the  factor  has  been  arrived  at  in  the  following  manner. 

As  described  in  reference  2,  the  original  code  has  a "debug"  feature. 
Before  calling  into  execution  any  operational  or  main  overhead  module,  the 
executive  program  examines  a logical  variable  DEB.  If  true,  it  first  calls  a 
subroutine  PRI  and  then  the  module.  The  subroutine  was  originally  designed  to 
print  (in  the  translated  version)  the  relevant  information  on  module  input. 

The  routine  has  been  completely  modified.  Instead  of  "printing"  anything,  it 
now  tallies  the  number  of  calls  and  the  PE  utilization.  The  tallies  can  be  ex- 
amined at  the  end  of  the  run,  and  statistical  information  on  the  efficiency  can 
be  extracted. 

As  explained  in  reference  2,  the  code  has  been  designed  to  keep  the  effi- 
ciency up  while  many  histories  are  processed  in  parallel.  The  efficiency  is 
expected  to  deteriorate  towards  the  end  of  the  run  when  the  processing  of  the 
bulk  of  histories  has  been  completed,  and  only  a few  histories  remain  to  be 
processed  to  their  random  end.  This  deterioration  is  expected  to  have  a negli- 
gible effect  on  the  overall  efficiency  for  normal  production  runs  which,  as  a 
rule,  involve  at  least  several  thousand  histories.  As  mentioned  before,  we 
were  not  able  to  complete  any  such  long  runs.  For  runs  consisting  of  only  a 
few  hundred  histories,  we  observed  a rather  low  overall  efficiency,  and  attri- 
buted this  fact  to  the  "dying  stages"  of  the  runs.  In  order  to  learn  anything 
about  efficiency  for  production  runs,  we  simply  terminated  the  tally  during 
the  dying  stages  of  the  run.  The  results  we  quote  below  are  based  on  these 
truncated  tallies. 
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A grand  average  over  all  tallies  for  all  operational  modules  indicates 
an  overall  efficiency  of  parallelism  of  0.72,  which  is  consistent  with  our 
original  estimate  - a rewarding  result. 

The  efficiency  of  parallelism  for  individual  modules  ranges  from  0.9  to 
0.1,  the  lower  efficiency  corresponding  to  modules  less  frequently  called  into 
execution.  This  property  was  built  into  the  design  of  the  code. 

As  discussed  in  reference  2,  the  algorithi.s  utilized  by  the  executive 
program  to  keep  the  efficiency  up,  can  be  improved.  No  major  effort  in  that 
direction  can  be  justified  if  the  efficiency  is  indeed  70?.  We  repeat,  however, 
that  this  result  has  been  shown  only  for  a single  test  proi,’em.  The  test  prob- 
lem is  such  that  minimal  amounts  of  memory  are  necessary  for  i.put  data  (geometry 
and  cross  sections).  The  room  available  for  latents  and  minilatents  is  therefore 
maximal  (25  of  each  in  every  P.E.).  If  that  room  is  reduced,  the  efficiency 
is  expected  to  decrease,  and  some  optimization  of  the  executive  may  become  de- 
s i rable. 

2.  Processing  Element  Speed 

Being  aware  of  the  fact  that  the  ILLIAC-IV  was  operating  at  reduced  speed 
and  with  instruction  overlap  suppressed,  we  decided  to  undertake  a fair  evalu- 
ation of  the  factor  For  that  purpose,  we  made  up  a problem  where 

both  factors  e and  f are  exactly  unity,  and  where  routing  is  not  invoked.  The 
problem  is  to  evaluate  the  sum  of  1000  random  numbers  for  6^  different  sequences. 
As  the  generation  of  a random  number  is  achieved  by  multiplication,  it  is  com- 
pletely parallelized  (see  Appendix  A of  reference  2).  Comparing  running  times 
on  the  ILLIAC-IV  and  on  a CDC  6600  computer,  we  obtain  = 6.02.  Substi- 

tuting that  result,  together  with  e = f = 1 into  Equation  (1)  we  obtain 

Se/^ee  = 0.Q3k 

instead  of  unity,  as  we  assumed. 
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3.  Overhead  Factor 

In  its  current  status  the  interim  code  includes  a number  of  consistency 
checks,  tallies,  interrupts,  which  are  necessary  in  the  current  operational 
environment.  This  contributes  to  the  overhead  and  therefore  decreases  the 
f-factor,  which  is  currently  estimated  to  be  of  the  order  of  0.1.  To  deter- 
mine the  different  components  of  the  overhead,  one  would  have  to  time  separately 
operational  and  overhead  modules  in  the  course  of  Monte  Carlo  execution.  This 
would  mean  building  in  interrupts,  which  would  themselves  affect  the  execution 
time.  A less  ambitious  analysis  has  been  performed.  The  tallies  referred  to 
in  subsection  1 above  include  the  number  of  calls  to  the  main  overhead  routines. 
The  computer  time  spent  by  these  routines  depends  very  much  on  the  actual  status 
of  latents  and  minilatents.  We  "guessed"  a typical  status,  and  timed  the  exe- 
cution of  these  routines  outside  the  Monte  Carlo  code.  The  only  reliable  result 
we  obtained  from  this  study  is  that  mini  latent  routing  is  the  chief  contributor 
to  the  overhead. 
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CODE  improvements 


The  interim  version  of  the  code  included  a complete  treatment  of  radia- 
tion penetration  by  the  Monte  Carlo  method.  It  was  designed'  to  treat  both 

2 

neutron  and  gamma  radiatipn,  but  was  implemented  only  for  gamma  rays.  The 
version  included  no  input/output  capabilities,  the  "input"  for  a singfe  test 
problem  being  coded-in  as  replacement  statements.  Considerable  effort  has  been 
( applied  to  complete  the  implementation  in  all  these  areas.  The  effort  has  been 

dropped  before  completion.  A minor  effort  to  improve  the  overall  efficiency 

of  the  interim  version  was  partially  successful. 

t 

I 1 . Improvements  in  Parallelism 

I As  described  in  reference  2,  the  executive  program  implements  given  al- 

['■ 

} gorithms  for  maintaining  a high  degree  of  efficiency.  The  algorithms  differ 

^ ^ in  degree  of  sophistication  depending  on  the  type  of  operational  module  in- 

i volved.  In  Section  8.3  of  reference  2,  it  is  recommended  that  all  algorithms 

{ 

r be  changed  to  conform  with  the  best  one.  This  has  been  successfully  done  in 

i the  ILLIAC  version.  The  results  of  our  efficiency  study  (Section  IV)  were  ob- 

tained with  the  improved  version. 

2.  Reduction  in  Overhead 

As  discussed  in  Section  IV. 3,  it  is  felt  that  the  major  contribution  to 
the  high  overhead  in  the  operation  of  the  code  can  be  ascribed  to  the  routing  of 
minilatents. 

Subroutine  MINIRT  (described  in  Section  7.2  of  reference  2)  has  been  com- 
pletely redesigned  for  greater  efficiency.  Attempts  to  make  the  new  version 
operational  were,  however,  unsuccessful. 


3.  Incorporation  of  Input/Output  Capabilities 

The  bulk  of  the  input  consists  of  geometry  data  and  of  cross  section  data. 

Both  sets  of  data  are  (successfully)  preprocessed  in  a conventional  computer. 

2 

The  data  are  organized  according  to  the  designed  layout  of  the  ARRAY  memory, 
and  transferred  to  AMES  via  ARPA  network.  Before  loading  on  the  ARRAY  processor, 
these  files  need  to  be  translated  to  ILLIAC  binary  format.  None  of  the  numerous 
attempts  to  use  (unreleased  and  unsupported)  system  software  routines  to  perform 
that  translation  were  completely  successful. 

The  bulk  of  the  output  consists  of  fluxes  averaged  over  phase  space  regions. 
This  output  is  transferred  onto  the  system's  mass  storage.  It  can  be  success- 
fully examined  interactively  using  an  (unreleased  and  unsupported)  system  soft- 
ware routine.  Several  steps  are  needed  to  obtain  a printed  formatted  output. 

The  first  step  is  to  translate  the  files  from  ILLIAC  binary  to  another  repre- 
sentation (either  decimal  or  host-binary).  Numerous  attempts  to  use  (unreleased 
and  unsupported)  system  software  routines  were  unsuccessful.  The  next  steps 
were  not  even  attempted:  file  transfer  to  a host  computer,  and  formatted  output 
by  a (trivial)  code  (not  yet  developed)  operating  on  that  host  computer. 

Transfer  of  ASCII  files  from  AMES  to  host  were  achieved  with  partial  success 
but  with  enough  difficulties  to  warrant  an  indefinite  postponement  of  this  effort. 
A.  Inclusion  of  Neutronics 

It  has  been  anticipated  that  this  task  would  be  the  major  one  of  the  project. 
Work  on  this  task  started  in  parallel  with  the  installation  of  the  code.  The 
task  had,  however,  been  terminated  before  completion  in  order  to  allow  work  to 
proceed  on  the  validation  of  the  interim  version. 

In  reference  1,  we  proposed  a complete  design  of  the  treatment  of  neutron 
Interactions.  The  design  was  aimed  at  optimizing  the  efficiency  of  the  data 
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retrieval  modules.  The  design  required  a specific  memory  layout  of  cross 

section  data  - it  is  given  in  tables  3,  ^ and  5 of  reference  1.  Some  waste  of 

memory  can  be  observed  in  the  storage  of  angular  distributions  (table  k)  and 

2 

energy  distributions  (table  5).  Our  experience  with  the  detailed  design  of 
the  code  pointed  out  the  importance  of  reducing  the  memory  requirements.  The 
memory  layout  has  therefore  been  completely  rearranged  in  the  case  of  angular 
and  energy  distributions  (at  the  burden  of  increasing  running  time  in  data  re- 
t r i eva 1 ) . 

A Fortran  program  has  been  written  to  produce  such  a layout.  The  input  is 
a "universal"  cross  section  library.  The  program  has  been  successfully  tested 
out  in  several  (but  not  all  possible)  situations. 

I The  ILLIAC  retrieval  routines  have  been  all  coded  up  and  underwent  some 

testing.  The  testing  has  not  been  completed.  The  subroutines  have  not  been  in- 
corporated into  the  Monte  Carlo  code. 

i 


I 

I 

I 


VI.  CONCLUSIONS 

The  attempt  to  validate  our  method  of  parallelizing  Monte  Carlo  calcula- 
tions were  unfortunately  performed  on  a computer  at  the  time  when  its  hardware 
and  software  were  rather  unreliable.  The  tentative  conclusions  which  we  can 
draw  from  this  attempt  are  not  in  contradiction  with  the  validity  of  our  approach. 
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